Robustness Assessment of Images From a 0.35T Scanner of an Integrated MRI-Linac: Characterization of Radiomics Features in Phantom and Patient Data

Purpose: Radiomics entails the extraction of quantitative imaging biomarkers (or radiomics features) hypothesized to provide additional pathophysiological and/or clinical information compared to qualitative visual observation and interpretation. This retrospective study explores the variability of radiomics features extracted from images acquired with the 0.35 T scanner of an integrated MRI-Linac. We hypothesized we would be able to identify features with high repeatability and reproducibility over various imaging conditions using phantom and patient imaging studies. We also compared findings from the literature relevant to our results. Methods: Eleven scans of a Magphan® RT phantom over 13 months and 11 scans of a ViewRay Daily QA phantom over 11 days constituted the phantom data. Patient datasets included 50 images from ten anonymized stereotactic body radiation therapy (SBRT) pancreatic cancer patients (50 Gy in 5 fractions). A True Fast Imaging with Steady-State Free Precession (TRUFI) pulse sequence was selected, using a voxel resolution of 1.5 mm × 1.5 mm × 1.5 mm and 1.5 mm × 1.5 mm × 3.0 mm for phantom and patient data, respectively. A total of 1087 shape-based, first, second, and higher order features were extracted followed by robustness analysis. Robustness was assessed with the Coefficient of Variation (CoV < 5%). Results: We identified 130 robust features across the datasets. Robust features were found within each category, except for 2 second-order sub-groups, namely, Gray Level Size Zone Matrix (GLSZM) and Neighborhood Gray Tone Difference Matrix (NGTDM). Additionally, several robust features agreed with findings from other stability assessments or predictive performance studies in the literature. Conclusion: We verified the stability of the 0.35 T scanner of an integrated MRI-Linac for longitudinal radiomics phantom studies and identified robust features over various imaging conditions. We conclude that phantom measurements can be used to identify robust radiomics features. More stability assessment research is warranted.


Introduction
Background Image-guided radiation therapy (IGRT) has experienced considerable advancements since the development and implementation of onboard cone-beam computed tomography (CBCT) systems. 1 Recently, radiation therapy systems with integrated MRI scanners have been introduced clinically providing superior soft-tissue contrast compared to X-ray-based imaging. 2 In addition to RECIST and other similar protocols based on visible tumor measurements [https://recist.eortc.org/], computed tomography (CT), positron emission tomography (PET), and magnetic resonance imaging (MRI) images are qualitatively analyzed by radiologists as a standard practice for screening, staging or decision-making purposes. 3 Quantitative analysis, or radiomics, aims to extract additional information from these standards of care images with the hypothesis that texture and voxel value distribution contain physiological information not discernable visually. 4 Images are converted into mineable data generating so-called radiomics features (imaging biomarkers) relating to pathophysiological processes which, combined with other patient data, are hypothesized to provide predictive or discriminative information. [4][5][6] By combining qualitative and quantitative data, the long-term goal is to build reliable descriptive clinical models, tailoring treatment to each patient and provide even further personalized oncology than available today. [7][8][9] Quantitative image analysis can be divided into several steps: image acquisition, segmentation, feature extraction, statistical analysis, and model building, each with unique challenges. 5,8,10 Features can be vulnerable to differences between and within image modalities such as fundamental imaging physics, imaging parameters, reconstruction methods, a segmentation method, feature extraction software, etc. [8][9][10][11][12] Comparison between institutions is therefore difficult and the lack of standardized methodologies is a major challenge for radiomics to overcome before clinical translation. 5,8,9 Furthermore, models based on nonrobust features will likely not provide reliable predictions when applied prospectively to new data. 13 Although no standardized guidelines on how to assess feature robustness have been developed, it is emphasized by The Image Biomarker Standardization Initiative (IBSI) 6 as a primary step in the feature selection process. 14 IBSI is an independent international collaboration aiming to establish common biomarker nomenclature and definitions for the radiomics community. Thus, identifying features that are robust under various imaging conditions is essential to develop clinical outcome prediction or clinical decision support systems. 5,14 ViewRay's MRIdian MRI-Linac (ViewRay Inc., Cleveland, OH) is a commercially available hybrid system combining a 0.35T scanner with a 6 MV flattening-filter-free (FFF) medical electron linear accelerator. 15 This system provides a potentially advantageous setting in which images for radiomics analysis are acquired within the context of radiotherapy treatment on a daily basis. However, reliable approaches and robust radiomics features acquired with this MRI-guided radiotherapy (MRIgRT) workflow still remain to be determined. In this retrospective study, we investigated radiomics features in both phantom and patient images acquired with the scanner of such a system, with a primary focus on robustness assessment, investigating the repeatability and reproducibility of the system and associated radiomics feature calculations. The aim was to explore longitudinal radiomics studies in invariant objects as well as identifying robust radiomics features across various imaging conditions. Additionally, a literature review over MRI-based radiomics with emphasis on either assessing robustness or various clinical correlations was included in this work for comparison, and to identify potential features fulfilling both the robustness and predictive criteria.

Literature Review
The main aim of this work was to investigate feature variability and performing a robustness assessment of the integrated MRI-Linac system in both phantom and patient data. A literature review with the purpose of providing a comprehensive summary of other available similar studies within MRI-based radiomics was included. The main literature collection took place between January and May 2020, but a few later published papers have been included after this. Most literature 14,16-28 was found through the PubMed database searching for, for example, "MRI radiomics," "MRI Linac radiomics," "MRI radiomics stability," "Radiomics phantom study," etc. A summary of published studies with similar questions, aims, or other relevant findings regarding feature variability based on their relevance to our study were therefore included. The primary goal was to characterize robust features in various imaging conditions. It is important to recall that feature robustness is not an implication of feature predictability or other biomarker correlation to any clinical task or outcome. 14 A secondary goal of the literature review was therefore to identify common radiomics features demonstrating both high robustness and significant clinical correlation. Thus, a summary of the relevant papers included in the literature review can be seen in Tables 1 and  2, where the study purpose, feature classes, and robust/

Phantom Properties
The Magphan ® RT Phantom ( Figure 1)  The ViewRay Daily QA Phantom ( Figure 2) is a cylindrical phantom filled with distilled water. It has 1 central and 4 surrounding cavities for insertion of an ionization chamber. 31

Data Selection
Eleven scans acquired over a 13-month period using the Magphan ® RT Phantom, and acquired over 11 workdays using the ViewRay Daily QA Phantom, respectively, constituted the complete phantom dataset.
The Institutional Review Board at the University of South Florida approved (IRB #20383) and waived the informed consent requirement for retrospective analysis in this study. Patient data included 50 images from 10 anonymized stereotactic body radiation therapy (SBRT) pancreas cancer patients treated with 50 Gy in 5 consecutive daily fractions. The kidneys and liver were chosen to represent theoretically invariant objects in the patient, assuming no significant effect of radiation during the course of treatment and consistent distance/ orientation relative to the pancreatic target, thus ensuring consistent location within the imaging coils. Both organs exhibit a desirable heterogeneity for radiomics studies, thus being appropriate alternatives as a transition from ideal imaging conditions to more complex structures as human tissue.
In summary, 4 datasets were included for statistical analysis of calculated radiomics features defined as follows: monthly phantom, daily phantom, patient kidney, and patient liver.

Image Acquisition and Registration
All phantom images were acquired using a torso coil and highresolution TRUFI pulse sequence with imaging parameters: 1.5 mm × 1.5 mm × 1.5 mm resolution, 500 mm × 449 mm × 432 mm Field of View (FOV) and 172 s total image acquisition time. Positioning and set-up were identical for every scanning occasion. All patient images were acquired using a torso coil and TRUFI pulse sequence with 1.5 mm × 1.5 mm × 3.0 mm resolution, 540 mm × 465 mm × 432 mm FOV and 25 s total imaging time (for faster imaging during treatment). Image export, import, segmentation, and registration were done in Mirada RTx (Mirada RTx 1.6, Mirada Medical, Oxford, UK).
Identical cylindrical 4.2 cm 3 VOIs were contoured in different sections of both phantoms: 4 regions in the Magphan ® RT Phantom ( Figure 3) and 2 regions in the ViewRay Daily QA Phantom ( Figure 4). All structures were propagated from the baseline to the remaining ten imaging sets by rigid registration in Mirada RTx. For each patient image a spherical 14 cm 3 VOI was placed in the midsection anteriorly/posteriorly, 4 cm caudally from the diaphragm, and 11 cm laterally from the aorta (Figure 5b), while kidneys were manually segmented by a single user (Figure 5a).

Statistical Analysis
Traverso et al 3 defined feature robustness into 2 main elements: repeatability and reproducibility. Repeatability refers to the agreement between measurements under identical imaging conditions, that is, intrasubject scanning using identical scanning parameters, set-up, equipment, etc. Reproducibility refers to the degree to which features stay unchanged under various imaging conditions, for example, identical imaging parameters but different subjects, different imaging parameters but the same subject, etc. In this study, features fulfilling both of these requirements were classified as robust.  In this work, the CoV was chosen as the figure of merit for robustness quantification since it allowed for a straightforward methodology to identify robust features within and between many subjects. It is defined as where s is the standard deviation and |µ| is the absolute value of the mean. CoV describes the dispersion of the data points, expressed as a percentage, where low values indicate high stability and vice versa.

Feature Extraction and Statistical Workflow
An in-house program, whose definitions are based on IBSI recommendations and those found in the work by Shafiq-ul-Hassan  et al, 9 was used to extract 1085 shape-based, first, second, and higher-order features (Table 3). Shape-based features describe various geometric properties of the VOI, such as volume, compactness, surface area, etc. 6,10 First order features relate to voxel intensity distribution within the VOI, with no regard to their relative spatial distribution. 5 Most of these features require intensity discretization of the 2D or 3D data before calculation. 6,10 Second-order statistics, also referred to as texture features, provide both intensity and spatial information. They describe the distribution of voxel intensity values between neighboring voxels along with different directions and distances and are derived from so-called gray-tone-spatial-dependence matrices. 5,6,32 The matrices used in this work were the gray-level co-occurrence matrix (GLCM), the gray-level run-length matrix (GLRLM), the gray-level size zone matrix (GLSZM), and the neighborhood gray-tone difference matrix (NGTDM). A full description of how these matrices are defined and of the subsequent feature extraction based on a 26-connected region in 3D is given in the IBSI manual. 6 Lastly, the higher-order statistical features apply various noise reduction or detail identifying filters on the images before feature extraction. 5,7 The filter-based approaches used in this study were Laws', 33,34 wavelets, [35][36][37] Laplacian transforms of Gaussian-filters (LoG), 5 and fractal analysis. 5,38 Repeatability and reproducibility were assessed with CoV < 5% as the threshold for feature robustness. Feature extraction was carried out for all imaging sessions and VOIs in each patient/phantom, followed by calculation of CoV. For both phantom datasets, each VOI was initially treated separately. The mean value of CoV for all VOIs (4 VOIs in the monthly phantom dataset and 2 in the daily) was then evaluated and robust features (CoV < 5%) in each dataset were identified. A similar initial feature selection procedure was applied to the patient kidney and liver data, respectively, calculating CoV for all features in each individual patient dataset first. Robust features were then identified by looking at the CoV mean between all patients within the kidney and liver datasets separately. Features fulfilling the robustness criteria in all 4 datasets were selected in the final step. Thus, the statistical workflow took into account both the repeatability and reproducibility criteria by looking at intrasubject variability in the first step, followed by intersubject analysis between different patients and as well in the final feature selection process.

Discussion
Our literature review included both phantom and patient data analysis, as well as different approaches to investigate reproducibility and repeatability. Cattell et al 14 43 presented a summary of various MRI texture phantom analysis studies in which different materials for simulating tumor heterogeneity were used. Most designs consisted of solid structures, usually polystyrene spheres or porous foams embedded in an agarose gel mixture. However, limitations regarding sensitivity to temperature and humidity are 2 factors to be overcome before handling these in multicenter trials. The phantoms in our work were designed for QA and consisted of homogeneous structures giving rise to a close to a binary signal. Prospective research would be to expand our analysis to texture phantoms similar to those found in the literature mentioned.
Radiomics is a fast emerging area and several studies on the subject have therefore been published since the time of our literature review. Sun et al 44 presented a recent phantom study on robustness analysis of images from a 1.5 T scanner of an integrated MRI-Linac. Like our results, they found a significant effect on feature variability from the test-retest cohort and therefore emphasize the importance of removing features that are sensitive to machine influence. No common robust features were identified between their work and ours. In another phantom study by Wong et al, 45 they investigated longitudinal feature repeatability on two 1.5 T scanners by acquiring 30 consecutive daily images of an ACR MRI phantom. Five of their repeatable shape-based features overlapped with our results, namely: maximum 3D diameter, sphericity, surface area, surface-to-volume ratio, and voxel volume. It should be noted that Maximum 3D diameter and Voxel volume were not identified in our literature review. Xue et al 46 investigated feature repeatability, reproducibility, and within-subject agreement in a clinical environment, looking at prostate cancer patients scanned on both a 1.5 T MRI-simulator and a 1.5 T MRI-Linac. Two robust features overlapped with our study: energy (wavelet LLL) and run-length nonuniformity (GLRLM). The authors conclude that a significantly smaller proportion of features pass the robustness criteria in their study, compared to a phantom study on the same MRI scanner and protocol. In agreement with our conclusions, they also emphasize the wider range of heterogeneity in patient data compared to phantoms.
We used an in-house developed program, based on the definitions given by IBSI, for feature extraction. However, studies show that features might be vulnerable to the choice of extraction software since calculation settings can vary. 12,47 Fornacon-Wood et al 47 compared the outcome between 4 platforms, 3 of which were IBSI-compliant, and concluded that choice of the program has an effect on feature variability as well as their correlation to clinical outcome. In the work by McNitt-Gray et al, 12 they looked at the agreement between different radiomics software packages under controlled conditions using standardized radiomics feature definitions (using the IBSI manual). They concluded that high levels of agreement between packages were achieved for some of the features while feature definitions requiring more complex derivations did not show the same levels of agreement. Thus, although standard definitions are being used, the choice of feature extraction software has an impact on the final determination, which should be taken into consideration when analyzing and comparing results. There is progress towards reaching common ground, but variations are still prevalent and remain a challenge for radiomics studies.
Another limitation to our analysis lies in the choice of cylindrical and spherical VOIs for phantom and patient (liver) data, respectively. These shapes do not have any unique long or short axis, which is of relevance for calculating many of the shapebased features. Volume and area are not affected but it is worth considering that some shape-based features may lose their meaning in these datasets.
Gray-level normalization is recommended 11,20,21,25 before feature extraction and analysis to reduce the effects of using different scanners, protocols, and reconstruction parameters. As concluded by Lacroix et al 25 image processing correcting for, for example, magnetic field inhomogeneity or voxel value normalization are 2 of numerous aspects shown to affect feature outcome. The effect of gray-level normalization is further emphasized by Collewet et al. 11 In our study each dataset was acquired with the same scanner and protocol. Since each dataset was analyzed separately before identifying common robust features among all data, normalization was omitted as it was assumed that the system produced similar images under the same imaging conditions. In fact, our robustness analysis is temporal to discern the potential effects of scanner drift on feature robustness. Interestingly, a recent study on a similar 0.35 T MRI-Linac system by Tomaszewski et al 22 looked at treatment response prediction for delta radiomics in pancreatic cancer patients and concluded that normalization reduces interscan signal variations as well as nonpathologic signal drift. They emphasize the importance of image preprocessing and robustness analysis before feature selection and present an explicit normalization method. We acknowledge that there may be many preprocessing techniques to improve feature robustness (SNR). Our assumption of no scanner drift is therefore a more conservative approach for the selection of robust features.
Our results indicate that 13 radiomics features overlapped between our analysis and with those identified as predictive/ prognostic in the literature review. Boldrini et al 23 looked at a similar 0.35 T MRI-Linac system as in this work whereof 9 common features could be identified. Although preliminary, this is a promising result suggesting a useful potential for radiomics studies on such a system across scanners and institutions. In another study on the same system by Tomaszewski et al, 22 several common features were identified in their robustness analysis, but no overlap was seen between their predictive features and our results; this can be expected since the test for robustness was completely different. The textural feature GLCM entropy has been characterized as a significant classifier for lesion discrimination in several studies as well as in stability assessment papers. The results are promising by identifying radiomics features for further investigation. Although a large number of features were classified as robust in our work, a substantial proportion were not (88%). MRI-based radiomics stability assessment has been investigated but to a limited extent, thus even though efforts are made in finding common methods, no consensus in stating feature robustness or their predictive power currently exists. The situation where features are found to be predictive but not robust must be further investigated. We, therefore, stress the importance of reporting feature variability and further emphasize the relevance of robustness assessment as a first step before starting any useful clinical correlation.
This work has investigated the robustness assessment of a 0.35 T integrated MRI-Linac with respect to derived radiomics features and provides a comprehensive and novel summary of longitudinal radiomics on such a system. We identified 130 robust features and conclude that certain radiomics features on images acquired with the low-field scanner of the system are stable over time. Phantom and human data were analyzed separately as a prior step, while the final analysis entailed a joint comparison and extraction of common robust features, which to our knowledge has not been performed on such a system before. Although no texture phantoms were used that reflect the complexity and wide range of gray levels observed in human tissue, the phantom analysis is valuable for representing ideal imaging conditions in a controlled experimental setting. Combined with patient data it is therefore useful as an indication of variability solely due to inherent machine properties. Thus, it is in our future interest to develop a heterogeneous phantom to further explore and confirm feature behavior on a low-field MRI-Linac.

Conclusion
This work has explored the longitudinal robustness of radiomics features studies on a low-field integrated MRI-Linac and assessed that the 0.35 T scanner of the system is sufficiently stable over time for such analysis. Our results indicate that robust features over a wide range of imaging conditions can be identified in both phantom and patient data, and we emphasize the usefulness of phantom studies for feature stability assessment as it provides a controlled setting. Developing a functional texture phantom for MRI-based radiomics would be of great interest in future studies. Furthermore, a literature review revealed that several of the features demonstrating a high level of stability in our analyses have also been found to be significantly related to various clinically relevant factors.

Ethics Statement
The Institutional Review Board at the University of South Florida approved (IRB #20383) and waived the informed consent requirement for retrospective analysis in this study.

Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was funded in part by the Crafoord Foundation travel grant (Sweden).